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Summary 

Construction methods for prior densities are investigated from a predictive viewpoint. Pre- 
dictive densities for future observables are constructed by using observed data. The simultane- 
ous distribution of future observables and observed data is assumed to belong to a parametric 
submodel of a multinomial model. Future observables and data are possibly dependent. The 
discrepancy of a predictive density to the true conditional density of future observables given ob- 
served data is evaluated by the Kullback-Leibler divergence. It is proved that limits of Bayesian 
predictive densities form an essentially complete class. Latent information priors are defined 
as priors maximizing the conditional mutual information between the parameter and the future 
observables given the observed data. Minimax predictive densities are constructed as limits 
of Bayesian predictive densities based on prior sequences converging to the latent information 
priors. 
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1. Introduction 

We construct predictive densities for future observables by using observed data. Future observ- 
ables and data are possibly dependent and the simultaneous distribution of them is assumed to 
belong to a submodel of a multinomial model. Various practically important models such as 
categorical models and graphical models are included in this class. 

Let X and y be finite sets composed of k and I elements, and let x and y be random variables 
that take values in X and y, respectively. Let M. = {p(x,y\9) \ 9 6 0} be a set of probability 
densities on X x y. The model A4 is regarded as a submodel of the fc/-nominal model with trial 
number 1. Here, we do not lose generality by assuming the trial number is 1. The model A4 is 

k I 

naturally regarded as a subset of the hyperplane {p = {pij) \ ]C Pij = 1} m Euclidean space 

i=ij=i 

In the following, we identify O with Ai. Then, the parameter space is endowed with the 

induced topology as a subset of 

A predictive density q(y; x) is defined as a function from Xxy to [0, 1] satisfying l(Vi x ) = 

yey 

1 (x £ X). The closeness of q(y;x) to the true conditional probability density p(y\x,0) is 
evaluated by the average Kullback-Leibler divergence: 

R(e,q) = y /P (x,y\0)log^^-, (1) 
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where we define clogO = — oo (c > 0), OlogO = 0, 01og(c/0) = (c > 0). Although the 
conditional probability p(y\x,9) is not uniquely defined when p(x\9) = 0, the risk value R(9,q) 
is uniquely determined because p(x, y\9) \ogp{y\x, 9) = if p(x\6) = 0. 

First, we show that, for every predictive density q(y; x), there exists a limit lim PTT„(y; x) of 

Bayesian predictive densities 

_ Jp(x,y\9)d7r n (0) 
P7rMX) •" Jp{x\9)^ n {9) ' 

where {-7r n }° < L 1 is a prior sequence, such that R(9, lim p n „(y, %)) < -R(#, <7(y; x)) for every 6 9. 

n— >oo 

In the terminology of statistical decision theory, this means that the class of predictive densities 
that are limits of Bayesian predictive densities is an essentially complete class. 

Next, we investigate latent information priors defined as priors maximizing the conditional 
mutual information between y and 6 given x. We obtain a constructing method for a prior 
sequence {7r n }^L 1 converging the latent information prior, based on which a minimax predictive 
density lim PTr n (y\x) is obtained. We consider limits of Bayesian predictive densities to deal 

71— >0O 

with conditional probabilities. 

There exist important previous studies on prior construction by using the unconditional 
mutual information. The reference prior by Bernardo (1979), (2005) is a prior maximizing the 
mutual information between 9 and y in the limit of the amount of information of y goes to 
infinity. It corresponds to the Jeffreys prior if there are no nuisance parameters; see Ibragimov 
and Hasminskii (1973) and Clarke and Barron (1994) for rigorous treatments. In coding theory, 
the prior maximizing the mutual information between y and 9 is used for Bayes coding. It 
was shown that the Bayes codes for finite alphabet models based on the priors are minimax 
by Gallager (1979) and Davisson and Leon-Garcia (1980). In our framework, these settings 
correspond to prediction of y without x. In statistical applications, x plays an important role 
because it corresponds to observed data, although X is an empty set in the reference analysis and 
the standard framework of information theory; see also Komaki (2004) for the relation between 
statistical prediction and Bayes coding. 

Geisser (1978), in the discussion of Bernardo (1978), discussed minimax prediction based on 
the risk function (TJJ as an alternative to the reference prior approach. 

The latent information priors introduced in the present paper bridge these two approaches. 
The theorems obtained below clarify the relation between the conditional mutual information 
and minimax prediction based on observed data. 

For Bayesian prediction of future observables by using observed data, Akaike (1983) discussed 
priors maximizing the mutual information between x and y and called them minimum informa- 
tion priors. Kuboki (1998) also proposed priors for Bayesian prediction based on an information 
theoretic quantity. These priors are different from latent information priors investigated in the 
present paper. 

In section 2, we prove that, for every predictive density q(y;x), there exists a predictive den- 
sity that is a limit of Bayesian predictive densities whose performance is not worse than that of 
q(y; x). In section 3, we introduce a construction method for minimax predictive densities as lim- 
its of Bayesian predictive densities. The method is based on the conditional mutual information 
between y and 9 given x. In section 4, we give some numerical results and discussions. 

2. Limits of Bayesian predictive densities 

In this section, we prove that the class of predictive densities that are limits of Bayesian predictive 
densities is an essentially complete class. 
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Throughout this paper, we assume the following conditions: 



Assumption 1. is compact. 

Assumption 2. For every x G X, there exists 9 G such that p(x\6) > 0. 

These assumptions are not restrictive. For Assumption 1, if is not compact, we can regard the 
closure as the parameter space instead of because we consider a submodel of a multinomial 
model. We do not lose generality by Assumption 2 because we can adopt X \ {xq} instead of X 
if there exists xq £ X such that p(xo\9) = for every 9 G 0. 

We prepare several preliminary results to prove Theorem 1 below. 

Let V be the set of all probability measures on endowed with the weak convergence 
topology and the corresponding Borel algebra. By the Prohorov theorem and Assumption 1, V 
is compact. 

When x and y are fixed, the function 9 G i — > p{x,y\9) G [0, 1] is bounded and continuous. 
Thus, for every fixed (x, y) G X x y, the function 



ir £V^ Pn (x,y) := I p(x,y\9)dTr(9) 

is continuous, because of the definition of weak convergence. Therefore, for every predictive 
density q(y; x), the function from V to [0, oo] defined by 



P7r(x,y) 



<li'l- ■'■)!', [■'■) 

= ^2p w {x,y)logp 7r {x,y) -J2p^( x ) 1 °SpA x ) ~ ^2 p ^ x ' y "> log q ( y > x ) 

X >V x {x,y):q(y,x)>Q 

- ^2 Pir( x ,y) l °Eq(y, x ) ( 2 ) 

is lower semicontinuous, because the last term in ([2]) is lower semicontinuous and the other terms 
are continuous. 

Lemma 1. Let ^ be a probability measure on 0. Then, V £fl = {e/i + (1 — e) ir \ ir G V} 
(0 < e < 1) is a closed subset of V. 

Proof. Suppose that -Koo £ V is the limit of a convergent sequence {'^h^'kLi i n V £ fi- Since 

r f(6)d* k (6) - e [ /(0)d/x(0)>O 



/• 



for every nonnegative bounded continuous function f{9) on 0. Thus, 

J f(9)d7T OQ (9) = Km J f{9)dir k (9) > e J f{9)d^(9). 

Hence, -Koo — e\i is a nonnegative measure. Therefore, TXoo G V e y,i and V £fl is a closed set in V . □ 

Lemma 2. Let /(■) be a continuous function from V to [0, oo], and let fi be a probability measure 
on such that := J p(x\0)dfj,(9) > for every x G X. Then, there is a probability measure 
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n \ n 



vrGP (n = l,2,3,..0 



such that /(7r„) = inf f(ir). Furthermore, there exists a convergent subsequence {^' m }m=i °f 
{7r n }^ ( L 1 such that the equality /(vr^,) = inf /(7r) holds, where 71^ = lim 71^. 

Proof. Note that there exists \i G P such that Pa{x) := J p(x\9)dfi(9) > for every x £ X 
by Assumption 2. By Lemma 1, the sets V^/ n (n = 1, 2, 3, ... ) are compact because they are 
closed subsets of a compact set P. Thus, there is a probability measure 7r„ in P M / n such that 
/(^n) = m f /( 7r )- There exists a convergent subsequence {vr^}^ =1 of {7i"n}£Li because P is 

compact. 

Since P is compact and f(n) is a continuous function of it £ P, there exists fr £ V such that 
/(ir) = inf /(7r). Thus, /(vr^,) > /(7r), where 71"^ := lim ir' m . For every e > 0, there exists 

7rGP " v m— ¥00 

5 > such that sup f(ir) < /(tt) + e, where d is the Prohorov metric on P. We put 

d(n,Tz)<5 

1 n — 1 

7i"n = -AiH 7T (n = 1,2,3, ...). 

n n 

Then, 7r n G P«/ ra and lim 7r n = tt. Thus, for every 5 > 0, there exists a positive integer N such 

' n— >oo 

that d(TT,TT n ) < 5 (n > N). If n > N, then /(tt^) < /(vr„) < /(vr n ) < /(tt) + e. Since e > is 
arbitrary, we have /(vr^,) < /(#). Therefore, /(vr^,) = /(#). □ 

The conditional probability p„-(y|aO is not uniquely specified if p^-(x) = 0. To resolve the 
problem, we consider a sequence of priors {vr n }^ ( L 1 that satisfies p nn (x) > for every n and 
x £ X. In the following, lim p-K n {y\x) is defined to be a map from (x,y) G X x 3^ to the limit 

n— >oo 

of the real number sequence {Pnnivlx)}^!- If there exist limits of sequence of real numbers 
{Pn n (y\x)}^ =1 for all (x,y) £ X x y, we say the limit lim p-w n {y\x) of Bayesian predictive 

n— >oo 

densities exists. Obviously, if the limit lim p nn {y\ x ) exists, it is a predictive density because 

n— too 

< lim p-w n {y\x) < 1 for every (x,y) £ X x y and Yl nm P-K n (v\ x ) = 1 f° r every x £ X. 
Theorem 1. 

1) For every predictive density q(y; x), there exists a convergent prior sequence {tTu}^^ such 
that the limit lim p-w n {y\x) exists and R(9, lim p^ n {y\x)) < R(9,q(y;x)) for every 6 £ 0. 

2) If there exists tt £ V such that D q {k) = inf D q (ir) and > for every x G A", then 
^(^iPtd/k)) < R(9iQ(y', x )) f° r every predictive density q(y;x) and # G O. 



Proof. 1) Let J\f q := {(x,y) £ X x y \ q(y;x) = 0} and 0« := {6 £ | £ p(z,y|0) = 0}. 

Let V q be the set of all probability measures on Q q . Then, Q q and V q are compact subsets of 
and P, respectively. 

If Q q = 0, the assertion is obvious, because R(9,q(y;x)) = 00 for 9 ^ Q q . We assume that 
09 / in the following. Let X q := {x £ X | 39 £ @ q such that P{x\9) > 0} and pfl be a 
probability measure on Q q such that p^(x) := J" p(x|0)d/it 9 (0) > for every x £ X q . 
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Then, because D q (ir) denned by ([2]) as a function of ir € V q is continuous, there exists 
7T„ G V^ q , := {(l/n)n q + (1 - l/n)vr [ vr € P 9 } such that A,(Vn) = inf D ? (7r). From Lemma 



2, there exists a convergent subsequence {vr^}^ =1 of {vr n }^ =1 such that D q (7r' 00 ) = inf D q {ir), 
where 71"^-, = lim ir' m . 

m— too 

Let n m be the integer satisfying 7r' m — T^n m - We can take a subsequence {ir' m }m = i such that 
< n m j (n m+ i — n m ) < c for some positive constant c. Since 



ii 



7T m + ( 1 



77, 



n. 



n m +i 



»-7, 



n m +i 



S e eV q . 



/u«/n m +i 



for every f9 S 9 , where <5g is the probability measure on Q q satisfying Sg({9}) = 1, we have 



for every 6 € Q q and < u < 1. Thus, 



-Q^D q (TT mi g jU ) 



2^' 0,y) 



du 

u=0 (x,y)^M" 



u=0 



(x,y)<£J\fi 



u=0 



log 



'm+l 



g(y; a;)p^ +1 (x) 



/ Pn' (x, y) log —7 ^ —- - 



sr ( \ 1 ^+i (x ' y) 
2^ Pn' (x,y)log 



+ 



n m +i 



. ^ £>_/ (x,y) 

£ p(x,y|0)log ; m+lV : ; >U. 



Hence, 



EfV (x, y) 
p(x y|$)log — — 



>- 



n m +i 



Ptt' (*,!/) log 



n,. 



p*uMy} 

q{y;x)p^ m+i {x) n m ^ - n„. 



'n m+1 - n m J^*U^*> ° g ,(y;xK Ui (x) 



+ 



71, 



1T"m+l Tit, 



p-' m+ S x >y) 

q(v:x)p 7T i (x) 



X] P^(x,y)logp 7r / n+i (y|: 



+ X] P<(^,y)logg(y;x) I 



(X, 
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> — Y] Pn' Ax,y)log . 

■I" E P^(x,y)log -T^gg^ ^? j + E Pn'Jx, y) log q(y;x)\, 

(3) 

where ,A/" 7r ° o := {(x, y) G A" x y \ p^r (x,y) = 0}. Here, we have 

hm > ^(x^jlog — r -—= > ^(x^jlog —, : t-t, (4) 

m->oo ^—^ m q(y:x)p-i (x) ^-^ °° qlyixjp^/ (x) 



because p w ' oo (x,y) > for every (x,y) ^ Af nco , and 

\ -n / It lncr 

q(y,x)p 7I ^ (x 



lim V p w / (x, y) log q(y; x) = = - V p n > (x,y) log Pw °°^ — . (5) 
{x,y)€Af^'oo\J\fi (x,y)gA^' r ^>\AA9 



Therefore, from ([3]), ([5]), and < n m /(n m+ i — n m ) < c, for every 9 G G 9 , 

By taking an appropriate subsequence {tt^ / }^ =1 of {7^1^=1 > we can make the sequences of 
real numbers {p^" (y\ x )}kLi converge for all (x,y) G X q x y because p n ^(x) > (x G Af 9 ) and 
<p n ' m (x,y)/p 7T , m (x) < 1. 

Then, from © , if (9 G @ q , 

< V p(x,y|e)log^^ = Ep(^>yl^)log^T = ^'^ x )) <0 °- 

' , q{y\x) q(y;x) 

Note that the risk R(6, lim does not depend on the choice of lim p n "(y\x) for x ^ Af 9 , 

although lim p^"(y\x) is not uniquely determined for such x. 

If 9 ^ 9 , x)) = oo because — ^ p(x, log g(y; x) = oo. For x ^ X q , p(x\6) > 

(x,y)eNt 

only when 9 ^ 9 . Thus, if x ^ X q is observed, then R(9,q(y;x)) = oo because 9 Q q . 
Hence, the risk of the predictive density defined by 

{lim p n "(y\x), x G X q 
k— ¥oo k 
r(y;x), x(£X q , 

where r(y; x) is an arbitrary predictive density, is not greater than that of q(y; x) for every 9 G G. 

Therefore, by taking a sequence {e n G (0, that converges rapidly enough to 0, we can 

construct a predictive density 

... f lim pMy\x), x G X q 
hm p £ , n,+(l-eu W (y \ x ) = \ k ^°° (V 

U(y|x), x^ 9 
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as a limit of Bayesian predictive densities based on priors £kp> + (1 — Sfc) 71 "^! where p, is a measure 
on such that Pjxix) > for every x G X . 

Hence, the risk of the predictive density (J7J) is not greater than that of q(y; x) for every 
9 G G. 



2) In this case, the proof becomes much simpler. We assume that Q q ^ because the assertion 
is obvious if Q q = 0. Then, D q {%) < oo and 7r(Q 9 ) = 1. Thus, we can set fi q = n in the proof of 
1). Furthermore, we can set p, = % because p%(x) > for every x G X. Therefore, the desired 
result can be proved without considering limits of Bayesian predictive densities. □ 



We give two simple examples to clarify the meaning of Theorem 1 and its proof. 

Example 1. Suppose that X = {0,1,2}, y = {0,1}, p{x,y\6) = { 2 X )0 X {1 - 9) 2 ~ x 9y{l - 9)^, 
and Q = [0,1]. Let q(y;x) = (x/2) y (l — x/2) ( - 1 ~ y \ which is the plug-in predictive density 
with the maximum likelihood estimate 9 = x/2. Then, M q = {(0, 1), (2,0)}, Q q = {0, 1}, and 
X q = {0, 2}. The prior defined by ti» := w5 + (1 - w)8 1 G V q (0 < w < 1) satisfies 

I>,(7rM) = inf D q (ir) = 0. 

We set n q = tt( w \ which satisfies Pai{x) > for x G X q . Then, we can set 7r n = ir^ (n = 
1,2,3,...) because ir( w > G V q „ , and D Q (ir^) = 0. Then, lim p n (y\x) = p ( W )(y\x). Thus, 

vr^ = tt^) and N w '°° = M q . 

The prior tt^"' does not specify the conditional density p 7T (w)(y\x = 1) because p n (w)(x = 
1) = 0. We set p(d9) = d9 and 



<-±P+(i-±l«<"> 

Then, lim p n »(y = 0|x = 0) = lim p w "(y = l\x = 2) = 1 and lim p w "(y = 0\x = 1) = 

k— too k k—too k fc— >oo k 

lim p-w k {y = ]\x = 1) = 1/2. The risk function of the predictive density lim p n "(y\x), which is 
a limit of the Bayesian predictive densities, is given by 



R(9, lim pMy\x)) 
k— >oo K 



o, 9 = o g e«, 

oo, 0e(o,i) = e\e«, 

0, 9=169" 



and coincides with R(9, q(y; x)). □ 



Example 2. Suppose that X = {0,1,2}, y = {0,1}, G = {9 1 ,9 2 }, p((2,O)|0i) =p((2,l)\9 1 ) = 0, 
p((O,O)|0i) =p((l s l)|fi) = lA p((O,l)|0i) = P ((l,0)\9 1 ) = 1/6, p((2,O)|0 a ) = p((2,l)|0 2 ) = 
(l-e)/2, and p((0, 0)|6» 2 ) = p((0, 1) |6» 2 ) = p((l, 0)|6> 2 ) = p((l, 1)|0 2 ) = e/4, where < e < 1. 

Consider a predictive density defined by q(y = 0; x = 0) = g(j/ = l;x = 1) = 2/3, = 
l ;x = 0) = q(y = 0;x = 1) = 1/3, q(y = 0;x = 2) = 1/3, and q(y = l;x = 2) = 2/3. Then, 
A/" 9 = 0, G« = G, V q = V, and X q = X. 

Then, tt = 5$ 1 satisfies D g (7t) = inf D q (ir) = because p{y\x, 9\) = q(y;x) except for the 

case x = 2. Since p[x = 2\9±) = 0, p^{y\x = 2) is not uniquely determined. Thus, we consider a 
limit of Bayesian predictive densities. 
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Put fj, = 5q 1 /2 + 5g 2 /2. It can be easily verified that 7r n = (l/n)/x + (1 — l/n)S$ 1 satisfies 
D q (Tr n ) = inf D q (ir). Then, lim p^„(y|a; = 0) = p(y|x = 0, 6>i) = g(y; x = 0), lim p nn (y\x = 

1) = p(y\x = 1, 6>i) = g(y;x = 1), p*- n (y|x = 2) = p(y\x = 2,6 2 ) + q(y,x = 2). By calculation, 
we have R(0 U lim p Wn (y\x)) = R(9 ± , q(y; x)) = and i?(0 2 , lim ^(yjx)) = (e/2) log(9/8) < 

R(02,q{y;x)) = (1/2) log (9/8). Thus, the performance of lim p-n n {y\x) is better than that of 

n— >oo 

q(y;x) □ 
3. Latent information priors and minimax prediction 

In this section, we construct minimax predictive densities that are limits of Bayesian predictive 
densities based on prior sequences converging to latent information priors defined below. 
A predictive density q(y; x) is said to be minimax if it satisfies the equality 



p(x,y\6)\og— = inf sup > p (x,y \9 log— 



The conditional mutual information between y and 6 given x is defined by 

I e, y \x{^) ■= I ^2 P(x,y\0) log p(x,y\6)dir (6) - ^p n (x, y) logp^x, y) 

-,y x,y 

^2p(x\0) \ogp(x\9)dir(9) + ^^(x) logp^x), 

X X 

which is a function of ir € V. If p n (x) ^ for all x G X, then 

Since ulogu (0 < u < 1) a bounded continuous function, lQ y i x (ir) is a bounded continuous 
function of ir G "P. 

We define a latent information prior as a prior fr that satisfies /^^(ir) = sup Iq^^it). 

Intuitively speaking, under the latent information prior, the parameter 9 has the maximum 
information about the future observable y under the condition that x is observed. Therefore, 9 
has the maximum amount of "latent" information, which we cannot observe through the data 
x. Thus, the latent information prior corresponds to the "worst case" and is naturally related 
to minimaxity. On the other hand, the minimum information prior discussed by Akaike (1983) 
is a prior maximizing the mutual information between the future observable y and the data x. 
This prior corresponds to the "best case" and is far from minimaxity. 

The priors ir^ and fc in Theorem 2 below are latent information priors. 

Theorem 2. 

1) There exists a convergent prior sequence {ir n }?° =1 such that lim p nn (y\x) is a minimax 

n— foo 

predictive density and the equality Ie^xi^oo) = sup lQ jy \ x (ir) holds, where ir^ = lim ir n . 

tt£V ' n->oo 

2) Let tc £ V be a prior maximizing Ig y ^ x (ir). If Pft(x) > for all x £ X, then p^(y\x) is a 
minimax predictive density. □ 
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Proof. 1) Let fx be a probability measure on such that p M (x) := J p(x\9)dfj,(9) > for every 
x €. X, and let 7r„ G V^/ n := {/x/n + (1 — l/n)ir \ ir G T 3 } be a prior satisfying Io t y\ x (ir n ) = 
sup -^,y|a;(7r). From Lemma 2, there exists a convergent subsequence {^' m }m=i °f {^nl^Li 

such that ^j/lrrC 71 "^) = SU P ^6»,j/|x( 7r )) where 71-^ = lim 71"^. Let n m be the integer satisfying 



= 7r ram . As in the proof of Theorem 1, we can take a subsequence {Tr' m }^ =1 such that 
< n„J (n rn+ i — n m ) < c for some positive constant c. 
Then, for every 9 G 6, 



7T„ 



—ir m + 1 



) <%} + (! - U ) 7r m+l 



belongs to 7^/„ m+1 for < u < 1, because (n m /n m+ i)7r^ + (1 - n m /n m+ i)<% G 7> M / nm+1 and 
Thus, 



u=0 



d_ 

du 



J2p(x,y\9) log p(x, y|0)d7r m j?)U (0) - ^ p #m e - u (x, y) log p*^ . ^ (x, y) 



-/Ep(^)io gJ >(^)«„ A ^) + E^,,»io S ^, s ,„( 

/ Y,P(x,y\0)logp(x,y\9)dn> m (9) +(l- -^-) £ 
J2p(x,y\0)logp(x,y\9)dn' m+1 (9)-^—p img J 

x,y x,y 



u=0 

p(x,y\9) log p(x,y\0) 



x,y) 



\og Pn , (x,y) 



u=0 



n r , 



^2 P (x\9)logp(x\e)dir' m (9) - (l - -^-) J>(x|0)logp(x|0) 



n m +i J 



+ 



X 

n r 



logp <n+i (x) 



J £>(*|0) logp(x\9)dn' m+1 (0) + £ ^P* m:S Jo 

x M=U 



n m +i. 



n r , 



n m +i 



_ P-i (x,y) _ (x,y) 

E y) log + (*> y) lQ g < «■ 



where we used 




' r7r m+l V ' 21,2/ 



n m +i 



Noting that p n > (x) > for every m and x £ X and that p(x, y|0) logp(y|x, 9) = if 
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p(x\9) = 0, we have 



x,y "m+l w ' ' x, 3/ 



J x,y ^""m+lA*' ' 



Hence, 



x <y m+i (x^^n^oo m+i 

+ y X] p( x ^y\ e ) l °sp(y\x,o)dTr' m (e) - J p( x ^y\ e ) l °sPir' m+1 {y\x)dTr' m (9) 

(a:,3/)eA/' 7r °° (x,y)<=N n '<x> 

n m +i [x^ f i mi P(2/F,0) i / / m 
< {/ E K^y|^)log P(y|X ;^ dvr^(g)+ / V p(x,y|0)logp(y|x^)d7r^(0) 



n m+ i - n r , 



{x,y)iN^ m+1 (x,y)£hf*'°< 

(8) 



x,y "m+l 

where A/" 71 "^ := {(x,y) G A* x y \ p^> (x,y) = 0}. Here, we have 



p ^ Mx) Lw«~ p * ,J,v) 

(9) 



and 



lim / Y] p(x,y\6)logp(y\x,6)dir' m (0) = / V p(x, y|0) logp(y|a;,0)d74,(0) 

m— >oo J ■'— ' J * — » 

(s,y)&A/'' r a° (x,y)eAf n 'oo 

(10) 

because y|0) logp(x, and p(x|0) logp(x|0) are bounded continuous functions of for 
every fixed (x, y). 

From (|SJ), ([9]), (fTUj) . and < n m /(n m+ i — n m ) < c, we have, for every G 0, 

tasup^Kz.^log — ^ < y gp(x, s ^)log d,,(«). 



r '/"ioo _,r r„/ loo 



By taking an appropriate subsequence {-k'^}^° =1 of {^' m }m=i> we can make {p^{y\x)}^ =l con- 
verges for every (x,y) as k oo. Then, for every £ 0, 

log /^f> - < / VJ„(.,v|9)log /M*f> d<. W , (11) 

, lim P7r"(2/F) j , lim p<yy\ x ) 



x iV k— >co fc x iV k— >oo fc 
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where tt'^ = ir'^ = lim n'l, because lim p^"(y\x) = p w " (y\x) for x with p n » (x) > 0. 

k— >oo k-^-oo k °° 

On the other hand, we have 

/ 5>(s,y|fl)fag r Mx, f\ M 0) = bd [ log ^^dCw 
J Tt , lim ?V'U/f) <? V ft 



?(l/M) d,r(6>) 



y^p(x,y\0)log— — ^— d7r(0) < inf sup / V]p(x, y|6>) log 

q ^ mx > q weV J 9 ^ ;a; 

= inf sup ^ p(x, y|0) log < sup £ pfo log Pfe^l . (12) 

The first equality is because the Bayes risk 

J R(e- q (y;x)y^(de) = J Y,p{x,y\0)log^^-^m 

with respect to tt^ G V is minimized when 

q (y,x)=p^(y\x) := /p( ^ )7r „ (d0) ; 

see Aitchison (1975). Although p^" is not uniquely determined for x with p^n {x) = 0, the 
Bayes risk does not depend on the choice of Pn'^(y\x) for such x. 
From (jlip and (|12p . we have 

inf sup > p(s,y fljlog^ = sup > p(x, y\9) log — — — . 

Therefore, the predictive density lim p w "(y\x) is minimax. 

fe— >oo k 

2) In this case, the proof becomes much simpler. By setting fx = tt in the proof of 1), we have 
7r n = 7r (n = 1,2,3,...). Thus, lim p n (y\x) = p^{y\x), and the desired result can be proved 

n— >oo 

without considering limits of Bayesian predictive densities. □ 



4. Numerical results and discussions 

Let p{x\9) = ( N x )9*{l - 9) N ~* (x = 0, 1, . . . , N), p(y\9) = [f)0«(l - 9) M ~v (y = 0, 1, . . . , M), 
and = {0.1/c | k = 0,1,2, ... , 10} in which 9 takes a value. Although this example is relatively 
simple in the sense that x and y are independent given 9, the behavior of priors is not trivial. 

The latent information priors, which maximize le^xi 11 )^ f° r 16 sets of values of (N,M) are 
obtained numerically; see Figure 1. 

The prior for (N, M) = (0, 1000) is almost uniform and is similar to the reference prior 
because the reference prior is the latent information prior with N = and M — > 00. It is 
widely known the reference prior is uniform when the parameter space is a finite set. The 
latent information prior for (N, M) = (0, 100) is similar to the histogram of the Jeffreys prior 
density 9~ l / 2 (l — 9)~ 1 / 2 / '5(1/2, 1/2) for the binomial model with the ordinary parameter space 

e = [o,i]. 

When both of N and M are small the priors assign weights only on a limited number of 
points in 0. This corresponds to the phenomenon concerning the /^-reference prior studied by 
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Figure 1. Latent information priors for various (N,M) values 



Berger, Bernardo, and Mendoza (1989). The ^-reference prior is the latent information prior 
with N = and M = k. 

When N is large, the priors assign more weights to parameter values close to 0.5. The shapes 
of priors are quite different from the uniform density or the histogram of the Jeffreys prior for 
the binomial model with the ordinary parameter space G = [0, 1]. 

These observations show that the latent information priors strongly depend on (N, M). This 
indicates that we need to abandon the context invariance (see Dawid (1983)) of priors. 

The relation between the conditional mutual information and predictive densities parallels 
to that between the unconditional mutual information and Bayes codes in information theory 
except for the care for the case p w (x) = 0. Many studies on the unconditional mutual infor- 
mation and minimax prediction and coding have been carried out; see, for example, Ibragimov 
and Hasminskii (1972), Gallager (1979), Davisson and Leon-Garcia (1980), Clark and Barron 
(1994), and Haussler (1997). See also Griinwald and Dawid (2004) for discussions in a very 
general setting. The conditional mutual information Iq^\ x {t^) coincides with the Bayes risk of 



12 



the Bayesian predictive density based on tt. Therefore, it is natural that the prior maximizing 
Iq v \ x (it) corresponds to minimax prediction based on data. 

In general, the priors based on the unconditional mutual information and that based on the 
conditional mutual information are quite different. Latent information priors maximizing the 
conditional mutual information could play important roles in statistical applications. Although 
we have discussed submodels of multinomial models, essential part of our discussion seem to hold 
for more general models under suitable regularity conditions including compactness of the model 
as in the theory based on the unconditional mutual information studied by Haussler (1997). 

The explicit forms of latent information priors are usually complex and difficult to obtain un- 
less the parameter space is finite. For actual applications, it is important to develop approxima- 
tion methods and asymptotic theory in various settings other than the situation iV = 0, M — > oo 
studied in the reference analysis. When Iq^ x (tt) is close to Ig : y\ x (^), a prior tt is considered to 
be close to fr because Ig y \ x (ir) is a concave function of tt. These topics require further research 
and will be discussed in other places. 
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